-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove health check and "bad" node concept from ClusterManager #132
base: main
Are you sure you want to change the base?
Conversation
a9d1707
to
c0e27a6
Compare
A concern was raised that our samples should not be demonstrating any form of polling from within an entity workflow loop, even if the poll frequency is low. Instead, we should point to long-running activity patterns.
@@ -229,9 +196,7 @@ def should_continue_as_new(self) -> bool: | |||
async def run(self, input: ClusterManagerInput) -> ClusterManagerResult: | |||
self.init(input) | |||
await workflow.wait_condition(lambda: self.state.cluster_started) | |||
# Perform health checks at intervals. | |||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no value in this being a loop anymore and no value in the wait condition having a timeout (and therefore no value in a sleep interval setting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, very good point! (Fixed same bug in Typescript version). I think this actually improves the sample a lot, since it may not be obvious to users that they can implement the main "loop" so simply, and it makes continue-as-new usage clearer and less intimidating.
) | ||
if self.should_continue_as_new(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will say this is a bit confusing to a user that may not understand tasks and how nodes lock is done. And I may have been wrong about removing the loop. Basically the interesting question is "can you guarantee if should_continue_as_new()
is True
in the wait condition lambda, that it is also True
here? The only way it could become True
in wait condition but False
here is if nodes_lock
went from unlocked to locked in the interim. The nodes_lock
does not surround the entire handlers, it's only after cluster started, and we don't really provide known coroutine ordering (only deterministic ordering).
Take this scenario:
- Workflow has long since been running and has a lot of history
- Handler is run and it awaits the wait condition for cluster started (which it will have long-since been true)
- Event loop run which hits both wait conditions in order, setting both to futures to true (
should_continue_as_new
wait condition is true becausenodes_lock
is not locked, and cluster started is true) - Now maybe handler acquires lock and begins to run
- Now maybe workflow function here runs and now
should_continue_as_new()
isFalse
May need the loop back or split the cluster shutdown and should continue as new wait conditions to separate tasks so you know which completed.
Another scenario, what if steps 4 and 5 above switch? What if this workflow thinks it should continue as new but then as coroutines settle post-completion the update runs. We'll continue as new in the middle of its activities. This needs that new all_handlers_finished
thing I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm trying to move too quickly here. Reinstated loop, and added wait for all_handlers_finished()
before CAN. It would be ideal to add some tests exercising CAN.
This is now blocked on an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose is to demonstrate how to use a mutex to protect interleaving between the handlers and the main coroutine and to show something that solves a real-world problem where events come both from the workflow itself and from outside. Is there an alternate way to code up a health check that wouldn't involve polling?
A concern was raised that our samples should not be demonstrating any form of polling from within an entity workflow loop, even if the poll frequency is low. Instead, we intend to point to long-running activity patterns.